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Abstract 

Inhomogeneous dynamical mean-field theory has been employed to solve many interesting 
strongly interacting problems from transport in multilayered devices to the properties of ultra- 
cold atoms in a trap. The main computational step, especially for large systems, is the problem of 
calculating the inverse of a large sparse matrix to solve Dyson's equation and determine the local 
Green's function at each lattice site from the corresponding local self-energy. We present a new 
efficient algorithm, the Lanczos-based low-rank algorithm, for the calculation of the inverse of a 
large sparse matrix which yields this local (imaginary time) Green's function. The Lanczos-based 
low-rank algorithm is based on a domain decomposition viewpoint, but avoids explicit calcula- 
tion of Schur complements and relies instead on low-rank matrix approximations derived from the 
Lanczos algorithm, for solving the Dyson equation. We report at least a 25-fold improvement of 
performance compared to explicit decomposition (such as sparse LU) of the matrix inverse. We 
also report that scaling relative to matrix sizes, of the low-rank correction method on the one hand 
and domain decomposition methods on the other, are comparable. 

PACS numbers: 31.15.xr, 31.15.aq, 71.10.Fd 
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INTRODUCTION 



The Dyson equation is at the heart of many important physical problems, taking its 



roots in many-body formalisms for the Green's function 



2|. The real-space version of the 



Dyson equation is of particular importance in the framework of inhomogeneous dynamical 
mean- field theory (IDMFT) js-Sj. In this framework, the solution of the Dyson equation, 
for a given lattice (e.g., square or cubic "optical" lattice), is being matched self-consistently 
to the solution of an impurity problem. A fixed-point iteration ensures that the two local 
Green's functions (for the lattice and for the impurity problem) are equal to each other. 
The self-energy, Sj, defined at each lattice site i, and introduced into the Dyson equation, 
is used for numerically matching the two problems. Details on the IDMFT algorithm can 



be found in 



4m- 



This paper focuses primarily on the first part of the IDMFT numerical method, the Dyson 
equation solver, which constitutes the most computationally intensive part of the IDMFT 
algorithm. Details on the second part, the impurity solvers, applied to the Fermi-Bose 



Falicov-Kimball model 



10Nl3|. can be found in Ref. 



a, Isl, |9| , as it is used for our numerical 
simulations shown below. 

The matrix form of the Dyson equation (along the imaginary time axis) used in any 
IDMFT implementation, is given by |2|, [j] : 

Gij{iujk) = {[Gijiiuk, U = 0)]~^ - Tji{iuJk)dij}~^ = {liij{iuJk)}~^ . (1) 

The matrix [Gij{iuJk', U = 0)] corresponds to the noninteracting Green's function, whose 
inverse includes a kinetic- energy term representing the nearest-neighbor hopping from site 
j to site i on a square (or cubic) lattice, plus the Matsubara frequency along the imaginary 
time axis, iuk, on the diagonal Hu, and the local potential which involves the sum of the 
chemical potential minus the local trapping potential; that is, we have [Gij{iujk; U = 0)]^^ = 
tij + {iuk+iJ'—Vi)6ij. It is noninteracting in the sense that it excludes the many-body potential 
energy terms, represented instead by the self-energy, T,i{iuk)- Note that the self-energy varies 
from site to site, and is therefore inhomogeneous [5| . The explicit values of 'EAiujj^ on each 
lattice site i are deduced from an impurity solver, not described here ay , [iSl, [ij] for each 
lattice site. For convenience, we have defined H in eq. ([I]) as, G = {H}~\ to be used when 
describing the new algorithm below. The matrix H corresponds to a lattice model, and it is 
non-singular, of rank N. It is a complex-symmetric matrix, with complex-diagonal elements 



and real-valued off-diagonal elements, called hopping terms. A similar expression for the 
Dyson equation can also be deduced along the real frequency axis, iuk oo + i5 {4]. Note 
that the off-diagonal elements in the matrix H are extremely sparse because the hopping is 
restricted to nearest neighbors, so there are typically only Z nonzero terms, where Z is the 
coordination number of the lattice (4 for a square lattice and 6 for a simple cubic lattice). 
In IDMFT, the quantity of interest is essentially the diagonal of the inverse, Gjj, which is 
then matched to the impurity Green's function [9j inside a self-consistency loop. Therefore, 
the main task of the numerical method developed below is to determine 

diag{G) = diag ((H)^"*^) 

from eq. ([1]). 



THE LANCZOS-BASED LOW-RANK CORRECTION ALGORITHM 

The new algorithm that solves the diagonal of the inverse of the matrix in eq. ([1]) essen- 
tially combines two general approaches of solvers, one iterative and one direct. Those two 



approaches are: the Lanczos scheme 



15l | and the domain decomposition [16|]. We first briefly 
recall the Lanczos algorithm and its important elements needed for the rest, then show an 
important example of domain decomposition, and describe finally the new Lanczos-based 
low-rank correction algorithm based on this example of domain decomposition. 

Standard Lanczos 

The standard Lanczos algorithm is a special case of the Arnoldi algorithm applied to 
real-symmetric or Hermitian matrices \\\- Let assume first that H is real-symmetric. The 
fundamental 3-terms recurrence of the Lanczos algorithm is given by [l^ 

= Hq^ - a^q^ - /^jq^^i, forj = 1, . . . , A^. 

This recurrence requires only one "matrix-vector" operation per Lanczos iteration, corre- 
sponding to the following operation: 

^ Hq,, (2) 
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where Uj is a temporary variable holding the resulting matrix- vector product. In matrix 
form, the Lanczos recurrence gives rise to a tridiagonal matrix Tj = tridiag{Pj,aj, /3j+i), 
at the j^^ Lanczos iteration. Once j = N, the rank of the matrix H, we obtain the exact 
relation 

H = QjyTjvQ^. (3) 
It would be possible in practice to determine from this above matrix decomposition (See 



for example Ref. 



19|). However, for very large this procedure is prohibitive, due to the ne- 



cessity to use and store all N Lanczos vectors, in either a partial or full re-orthogonalization 



18| implementation. 



As 



ust stated, this Lanczos scheme applies only to real-symmetric (or Hermitian) ma- 



trices 18|. When H is complex-symmetric, as in IDMFT, the standard Lanczos scheme 



needs to be modified. Freund et al. 



20| have shown that applying the Lanczos algorithm to 



complex-symmetric matrices can be done by replacing everywhere in the algorithm (includ- 
ing re-orthogonalization) the usual inner product by an indefinite inner product defined as 
a transposition only, excluding the conjugation, i.e.: (q, u)c = q^u, where q and u are any 
two complex vectors. The corresponding indefinite "norm" (inside quotation marks, because 



it does not satisfy fundamental properties of a norm) is: \\q\\c = \/ (q, q)c) the principal 
square root. Note that the resulting indefinite "norm" can be a complex number, and thus 
result in a possible breakdown, or division by zero, in the algorithm (when 1 1^1 |c may become 
null for two non-null vectors). This potential breakdown is taken care of by allowing for 
a few extra iterations, making sure that the null space can be reached within a sufficient 
number of Lanczos steps, as further described below in the context of the Lanczos-based 
low-rank algorithm. 



Standard Domain Decomposition 



Domain decomposition is a very broad method [16|. We describe here only one applica- 
tion, used next in the Lanczos-based low-rank algorithm. Consider now two different site 
labelings of a square lattice depicted in Fig. [1] A straightforward labeling of the nearest- 
neighbor hopping matrix, as part of the local Green's function Gij{U = 0) for lattice sites 
j to i of eq. ([I]), is shown on the left of Fig. [H where lattice sites are labeled consecu- 
tively from bottom-left to top-right. This labeling corresponds to a unique domain. For 



4 



Full domain 



Domain decomposition 



• 

21 


• 

22 


• 

23 


• 

24 


• 

25 




• 

11 


• 

12 


• 

25 


• 

15 


• 

16 


• 

16 


• 

17 


• 

IS 


• 

19 


• 

20 




e 

9 


9 

10 


• 

24 


9 

13 


9 

14 


• 

11 


• 

12 


• 

13 


• 

14 


• 

15 


• 

19 


• 

20 


• 

21 


• 

22 


• 

23 


6 


7 


8 


9 


10 




3 


4 


18 


7 


8 


1 


2 


3 


4 


5 




• 

1 


• 

2 


• 

17 


• 

5 


• 

6 



FIG. 1: A 2-dimensional example of a domain, J7, discretized with n = 5^ equidistant grid points. 
The left part shows a conventional ordering of grid points; the right side shows a domain decom- 
position ordering, where grid points in the four subdomains are numbered first, followed by the 
interface points. 



hopping occuring only between nearest-neighbors, this straightforward numbering leads to 
a pent a- diagonal matrix (similarly to a 5-point Laplacian operator), in two-dimensions. Of 
course, a multitude of different labelings can be constructed. Consider another important 
example, shown on the right of Fig. [H where a decomposition into four subdomains of the 
nearest-neighbor lattice sites has been devised. The corresponding matrix H, containing 
four subdomains, is no longer penta-diagonal, but has now the following form: 
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The four blocks, Bj, i = 1, . . . 
similarly for the E (and E"^): 



The Schur complement 



4, can then be grouped into a block diagonal matrix B, and 



H 



B E 

E^ C 



C - E^B^E 



(4) 



(5) 



is then defined such that the matrix in eq. (HI) can be re- written as a product of lower and 

n 

upper block-triangular matrices p/7|, as: 



H 




where I is the identity matrix of proper dimension. The rank m of the Schur complement 
matrix S is equal to the number of interface points. In the example of Fig. [T]on the right, 
the interface points labeled from 17 to 25 give a rank m = 9 for S, for an entire matrix H 
of rank = 25. 

The main expression giving the inverse of H based on the domain decomposition of Fig. [H 
can now be deduced from the above product of matrices. The resulting inverse takes the 
form: 






S-^ ((E^B-i) - I) . (6) 



Standard domain decomposition methods for finding H^^ (or only its diagonal) are based 
on solving eq. (|6]), with the inverse of the Schur complement obtained from the triple matrix 
product in eq. (|5]). This expression involves several matrix multiplications plus an inversion 
of the full Schur complement matrix S. Efficient methods which drop non-significant terms 



(when only the diagonal is needed) are essential for obtaining a rapid algorithm |2ll-l23| , at 
the price of higher complexity. Recalling that the dimension of the matrix S equals that of 
C, i.e., the number of interface points m, then using more domains leads to more interface 
points, and consequently to a larger size Schur complement which is obviously more difficult 
to invert. We have gathered at this point all the necessary ingredients for developing the 
Lanczos-based low-rank algorithm. 



Combination of Lanczos with Domain Decomposition 

The Lanczos-based low-rank correction algorithm starts with an identical decomposition 
of the domain as shown on the right of Fig. [T], leading to eq. (j6]) for the inverse H~^. 
It however avoids any explicit calculation of the inverse of the Schur complement. The 
algorithm is obtained from eq. (EI) by first grouping the inverse together with the 
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blocks B ^ on the right-hand side of eq. ([6]), and consequently defining X as: 

X ^ H- - f ° ) = f ^^"^^ ] S- ((E-B-) - I) . (7) 



The fundamental idea of this new algorithm is based on the fact that the right-hand side 
of eq. ([7]) has rank m, much smaller than A^. Consequently, X in eq. ([7]), is also of rank 
m <^ N . To see this, recall that the rank of S corresponds to the number of interface points 
of the domain decomposition, related to C as defined in eq. (jl]). The matrix X has, by 
construction, the same rank as S (or C). The diagonal of the inverse of H can thus be 
determined after m <^ N Lanczos steps with the matrix X: 

X = QmTmQm) (8) 

i.e., instead of using the entire steps on the original matrix H, as in eq. (|3]). Beyond 
the m*'^ Lanczos iteration the null space is reached and qm+iPm^^m = — so the expression 
in eq. (|H]) is exact. In this situation, we have the exact following identity, after combining 
eq. dSD with eq. ([7]): 

Q^T^Q^ = ^^"^^^ j ((E^B-) - I) . (9) 
Equation (|9]) is the fundamental identity of the Lanczos-based low-rank correction algorithm. 



In any standard Lanczos implementation 18[, the matrix- vector operation given in the 



expression (jj]) must now be replaced by a "matrix-vector" operation with X, as: 

Uj ^ Xq^-. 

This in fact translates as one system solve, one inversion of small block matrices, and an 
update based on the difference defined in eq. ([7]): 

Solve Hvj- = q^., (10) 

[b ' o\ 

Compute w,- = q-, (11) 

Output: Uj -ir- Vj—Wj. (12) 

Equations (!T0l)- (fT2|) replace the usual matrix-vector operation of standard Lanczos imple- 
mentations 181] . The initial problem of finding the inverse of a matrix has thus been replaced 
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by one system solve per Lanczos step, for the matrix H in eq. (fTOj) . In the example of Fig. [T] 
on the right, only 9 out of 25 possible Lanczos steps are required on X, and consequently, 
only 9 system solves based on eq. fllOl) are necessary. Computing the block diagonals in 
eq. flTTl) is relatively straightforward, since each block has relatively small size. 
The diagonal of the inverse is finally obtained from the expression: 



deduced from eq. ([7]) and eq. ([9]) for the diagonal terms only. The diagonal of B~Ms a 
constant matrix and needs to be calculated only once at the very beginning of the Lanczos 
steps. Standard domain decomposition given by eq. and using eq. requires two 
matrix-matrix multiplications for obtaining the inverse of the Schur complement, while in 
the Lanczos scheme, the matrix is simply a tridiagonal matrix tridiag{l3m, ctm, f^m+i), 
which implies that the product of the three matrices diag[Q.i^TmQm] ^q. (fT3l) is much 
simpler, and corresponds to the following algebraic expression obtained by updating only 
three Lanczos vectors: 



The symbol stands for term-by-term vector product. 

Note that, as mentioned before in the Lanczos Section, the algorithm of Freund et al. 
[20I applied to complex-symmetric matrices potentially give null eigenvalues in the matrix 
Tm, once the m Lanczos steps are done. This fact translates as the null space being visited 
before the termination of the m Lanczos iterations. It implies that once m Lanczos iterations 
are performed, the Lanczos vector basis does not yet form a complete basis of the low-rank 
matrix X, and therefore eq. ([8]) still contains an error term that is not yet exactly null. 
We use ttj, i = 1, . . . ,m, the diagonal terms of T^, as a stopping criteria; as soon as aj 
becomes null then the null space is reached. In practice, a few extra iterations beyond m are 
allowed. In the IDMFT implementation, we find that no more than 5 or 10 extra iterations 
are required, for m of the order of 100 for instance. Other stopping criteria, such as | |q| | < e, 
using the C2 norm on the Lanczos vectors, or by eventually computing all eigenvalues of 
Tm when m is relatively small, can be used for determining when the null space is reached 
beyond the m Lanczos steps, and then stop. 




(13) 



diag (Q^T^Q^) = A(qi-i © qj + © + A+i(qi © q^+i)- 
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Note finally that the Lanczos-based low-rank algorithm is well-suited to recursive calls, 
as for any domain decomposition implementation. Various levels of recursion of the domain 
in the Lanczos-based low-rank correction method are done by alternating the number of 
domains, {dx,dy), respectively along the x and y directions, using for instance the series 
{d^, dy) = (2, 1), (1, 2), (2, 1), . . . We find that this approach is the optimal recursion, reducing 
both the number of interface points and the number of necessary recursions. 



NUMERICAL SIMULATIONS 



We evaluate next the performance of the new algorithm, applied to IDMFT for a two 



dimensional Fermi-Bose Falicov-Kimball model Hamiltonian [6|, |7|, |12|, Il3|] . In IDMFT, mul- 
tiple inversions of eq. ([1]) need to be performed, i.e., for each Matsubara frequency included 
in the calculation, iuk for A; = up to kmax, and at each self-consistency iteration of the 
IDMFT loop (where the impurity and the Dyson solutions are matched, as briefly described 
in the Introduction). More details on the parallel IDMFT algorithm and its performance 
can be found in Refs. {g, 7, 9|. Note that a "probing" method for inverting eq. ([1]), that takes 
advantage of diagonal dominancy for some matrices with large Matsubara frequencies iuk, 
has been implemented and is also discussed in Refs. {q, [2^. However, as the temperature 
parameter T of the simulation diminishes, i.e., the ultimate goal in "ultracold" simulations 
6|-|8|, the matrices tend to become indefinite. The matrices H are also mostly indefinite 
in the real-time Green's function problem. Therefore, we focus in this short paper on the 
computer performance of the Lanczos-based low-rank algorithm for the inversion of general 
matrices, including indefinite matrices. 

The original version of the IDMFT code jo] used LAPACK routines, where a decomposi- 
tion into H = LU was then completed with a full inversion, G = = U^^L^^, for finally 
extracting the diagonal. These complex-valued LAPACK routines, however, make use of 
dense matrix techniques. For the sake of comparison, we compare instead the performance 
of the Lanczos-based low-rank implementation to that of a preconditioned- GMRES 17 1 
solver that finds the solutions of the following systems of equations, using sparse techniques: 



for i = l,N, 



(14) 



where the e"s are Euclidian vectors. Each vector x^, solution of the above system, corre- 
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spends to the columns of G = H~^; the diagonal of the inverse is then extracted from the x] 
components. We note that the performance of this rather expensive algorithm, is still better 
than using the U^^L^^ decomposition from LAPACK; sparse solvers for eq. ( 1141) thusgive 
a higher bound of performance to the dense U^^L^^ decompositions previously used j6|. 

The performances of the Lanczos-based low-rank method shown next were obtained from 
calculations on a Cray XT5 supercomputer (the machine "Kraken") located at the NICS, 
as part of the Teragrid. The dimension of the matrices were set to = = 101^, where 
n represents the number of lattice sites in both the x and y directions. 63 Matsubara 
frequencies, and their corresponding matrices, are calculated (on separate processors) for 
each of the self-consistency steps. The accuracy of the final solution is one part in 10^ for 
all calculations. A preconditioned-GMRES sparse solver, based on eq. (IT^ . takes at most 
717.64 CPU seconds (~ 12 minutes) for finding one of the 63 inverses of H.{iujk), among 
all the possible inversions occuring in the entire self-consistency iteration. (It corresponds 
in fact to Uq, the smallest Matsubara frequency.) This implies that if for example 20 self- 
consistency IDMFT iterations are necessary for achieving self-consistency, then the total 
time for the IDMFT loop is about 20 x 718 = 14,360 seconds, or ~4 hours, when a method 
based on eq. ( IT^ is used. Using the Lanczos-based low-rank correction method, the exact 
same IDMFT calculation takes at most 28.66 CPU seconds for finding the same inverses. 
Thus, the 20 iterations in the IDMFT loop discussed above take a mere 20 x 29 = 580 
seconds, i.e., less than 10 minutes. This corresponds to a dramatic increase in performance 
(irrespective of the type of matrices, indefinite or diagonally dominant), i.e., at least a 25- 
fold increase in computational speed (the increase versus the dense LAPACK approach is 
much larger). 

We report in addition that the scaling with the dimension of the matrices, between 
domain decomposition with an explicit calculation of the Schur complement based on eq. (|6]), 
and the Lanczos-based low-rank correction, are equivalent {q]. They however have differ- 
ent prefactors (slightly higher in the Lanczos-based algorithm) due to different levels of 
optimization in the programs. This implies that the performance of the Lanczos-based low- 
rank method is equivalent to that of calculating explicitly the Schur complement inverses, 
a method much more complex in its implementation. This property will become important 
when considering in the near future 3-dimensional implementations of this algorithm. 
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CONCLUSIONS AND PERSPECTIVES 



The Lanczos-based low-rank correction method elegantly combines two important and 
usually distinct numerical methods for finding the diagonal of the inverse of complex sym- 
metric sparse matrices: the Lanczos iterative scheme, and the direct domain decomposition 
method. We have shown how this new algorithm avoids the explicit computation of the 
Schur complement. The algorithm transforms the inversion problem into a much smaller set 
of system solves, using a low-rank version of the original matrix to be inverted, and where the 
right-hand side vectors of the system solves are Lanczos vectors. We used a 2-dimensional 
implementation of IDMFT applied to the Fermi-Bose Falicov-Kimball model Hamiltonian 
and showed that the performance of this new algorithm corresponds to a 25-fold improve- 
ment, compared to other sparse methods. The algorithm is currently being implemented 
inside a three-dimensional IDMFT Dyson solver applied to the Hubbard model 25|], and 
combined with a continuous time quantum Monte-Carlo impurity solver to solve for system 
sizes on the order of a few million for direct comparison to experiment 26[ |. 
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